Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(perf): Lazy rewrite of schemapi.py, improve docs #3547

Draft
wants to merge 111 commits into
base: main
Choose a base branch
from

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Aug 19, 2024

Related #3545 (comment)

Description

I went down a few too many rabbit holes in this PR.

This comment has an overview of most of the changes.

Misc

Removed this in 8002dab (#3547) but if anyone was curious:

Related

Benchmarking code

altair.tests.utils.test_schemapi.py

...

_SKIP_SLOW_BENCHMARKS: bool = True
_REPEAT_TIMES = 1000


@pytest.mark.parametrize("to_or_from", ["to_dict-validate", "to_dict", "from_dict"])
@pytest.mark.filterwarnings("ignore:.*:UserWarning")
@pytest.mark.skipif(
    _SKIP_SLOW_BENCHMARKS,
    reason="Should only be run in isolation to test single threaded performance.",
)
def test_chart_validation_benchmark(
    to_or_from: Literal["to_dict-validate", "to_dict", "from_dict"],
) -> None:
    """
    Intended to isolate `Chart.(to|from)_dict.` calls.

    Repeated ``_REPEAT_TIMES`` times, non-parametric:
    - in an attempt to limit the potential overhead of ``pytest``
    - but enforce ``1`` thread, like a user-code would be.

    Results
    -------
    ```
    _REPEAT_TIMES = 1000
    pytest -k test_chart_validation_benchmark  --numprocesses=3 --durations=3 tests

    # Pre-`SchemaBase.from_dict` refactor (3.12.3)
    108.16s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict-validate]
    84.62s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[from_dict]
    66.71s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict]

    # Post-`SchemaBase.from_dict` refactor (3.12.3)
    107.84s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict-validate]
    50.43s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[from_dict]
    67.07s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict]

    # Post-`SchemaBase.__init_subclass__` addition (3.12.3)
    108.24s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict-validate]
    50.33s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[from_dict]
    66.51s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict]

    # Post-`dict` branch micro optimization in `_FromDict.from_dict` (3.12.3)
    107.90s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict-validate]
    49.63s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[from_dict]
    66.87s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict]
    ```
    """
    from itertools import chain, repeat

    if TYPE_CHECKING:
        from altair.typing import ChartType

    def _iter_charts() -> Iterator[ChartType]:
        """
        Ensures only len(chart_funcs_error_message) actual charts are constructed.

        The `to_dict` calls are what gets multiplied
        """
        charts: list[ChartType] = [fn() for fn, _ in chart_funcs_error_message]
        yield from chain.from_iterable(repeat(charts, times=_REPEAT_TIMES))

    def _iter_chart_factory() -> Iterator[ChartType]:
        """
        Validation not the bottleneck, but encode is.

        Ensures at least `times` * len(chart_funcs_error_message) .encode calls are made.
        """
        chart_funcs: list[Callable[[], ChartType]] = [
            fn for fn, _ in chart_funcs_error_message
        ]
        for fn in chain.from_iterable(repeat(chart_funcs, times=_REPEAT_TIMES)):
            yield fn()

    def _to_dict(validate: bool) -> None:
        if validate:
            for chart in _iter_charts():
                with pytest.raises(schemapi.SchemaValidationError):
                    chart.to_dict(validate=validate)
        else:
            for chart in _iter_charts():
                chart.to_dict(validate=validate)

    if to_or_from == "to_dict":
        _to_dict(validate=False)
    elif to_or_from == "to_dict-validate":
        _to_dict(validate=True)
    else:
        assert list(_iter_chart_factory())

Every call was identical as it was based on an existing constant `jsonschema_version_str`
I'm going to do this a lot.
Docstrings can be collapsed in all editors and can benefit from markdown.

Everything here is already private, so using long comments has no benefit
`typeshed` disagrees with `jsonschema`, this is just enforcing what `jsonschema` says is true
Produces the same result, but skips the upfront `deepcopy`.
No longer modifying the copy inplace, new objects are created inside the iterator.
Previously, using a version below `4.0.1` would still always check first if there was a property. This would not change between checks.

Defining in this style removes the need for as much documentation, since the version guards are very clear when each branch is used.
First non-failing version.
Have left most of the original code in. Planning to migrate & adapt the comments before removing.

#
@dangotbanned
Copy link
Member Author

Pulled out alternative error message idea from #3547 (comment) out for later discussion.

The issue this was attached to has been resolved, but I still think this is an area the SchemaValidationError message could be refined.


Note

Originally added to use like below.
However this was naive, due to this skipping disjoint enums.

from functools import reduce

def _lazy_enum(iterable: Iterable[ValidationError], /) -> Iterator[ValidationError]:
    yield reduce(_enum_inner, iterable)

Possible alternative

Rather than preserving each enum, we could just extract the union of them.

It would produce a different error message than the original, so for the below:

chart_error_example__invalid_timeunit_value,
r"""'invalid_value' is an invalid value for `timeUnit`. Valid values are:
- One of \['year', 'quarter', 'month', 'week', 'day', 'dayofyear', 'date', 'hours', 'minutes', 'seconds', 'milliseconds'\]
- One of \['utcyear', 'utcquarter', 'utcmonth', 'utcweek', 'utcday', 'utcdayofyear', 'utcdate', 'utchours', 'utcminutes', 'utcseconds', 'utcmilliseconds'\]
- One of \['yearquarter', 'yearquartermonth', 'yearmonth', 'yearmonthdate', 'yearmonthdatehours', 'yearmonthdatehoursminutes', 'yearmonthdatehoursminutesseconds', 'yearweek', 'yearweekday', 'yearweekdayhours', 'yearweekdayhoursminutes', 'yearweekdayhoursminutesseconds', 'yeardayofyear', 'quartermonth', 'monthdate', 'monthdatehours', 'monthdatehoursminutes', 'monthdatehoursminutesseconds', 'weekday', 'weekdayhours', 'weekdayhoursminutes', 'weekdayhoursminutesseconds', 'dayhours', 'dayhoursminutes', 'dayhoursminutesseconds', 'hoursminutes', 'hoursminutesseconds', 'minutesseconds', 'secondsmilliseconds'\]
- One of \['utcyearquarter', 'utcyearquartermonth', 'utcyearmonth', 'utcyearmonthdate', 'utcyearmonthdatehours', 'utcyearmonthdatehoursminutes', 'utcyearmonthdatehoursminutesseconds', 'utcyearweek', 'utcyearweekday', 'utcyearweekdayhours', 'utcyearweekdayhoursminutes', 'utcyearweekdayhoursminutesseconds', 'utcyeardayofyear', 'utcquartermonth', 'utcmonthdate', 'utcmonthdatehours', 'utcmonthdatehoursminutes', 'utcmonthdatehoursminutesseconds', 'utcweekday', 'utcweekdayhours', 'utcweekdayhoursminutes', 'utcweekdayhoursminutesseconds', 'utcdayhours', 'utcdayhoursminutes', 'utcdayhoursminutesseconds', 'utchoursminutes', 'utchoursminutesseconds', 'utcminutesseconds', 'utcsecondsmilliseconds'\]
- One of \['binnedyear', 'binnedyearquarter', 'binnedyearquartermonth', 'binnedyearmonth', 'binnedyearmonthdate', 'binnedyearmonthdatehours', 'binnedyearmonthdatehoursminutes', 'binnedyearmonthdatehoursminutesseconds', 'binnedyearweek', 'binnedyearweekday', 'binnedyearweekdayhours', 'binnedyearweekdayhoursminutes', 'binnedyearweekdayhoursminutesseconds', 'binnedyeardayofyear'\]
- One of \['binnedutcyear', 'binnedutcyearquarter', 'binnedutcyearquartermonth', 'binnedutcyearmonth', 'binnedutcyearmonthdate', 'binnedutcyearmonthdatehours', 'binnedutcyearmonthdatehoursminutes', 'binnedutcyearmonthdatehoursminutesseconds', 'binnedutcyearweek', 'binnedutcyearweekday', 'binnedutcyearweekdayhours', 'binnedutcyearweekdayhoursminutes', 'binnedutcyearweekdayhoursminutesseconds', 'binnedutcyeardayofyear'\]
- Of type 'object'$""",

We could adapt the formatting from SchemaValidationError._format_params_as_table to print something like this:

Large Screenshot

image

Which would just read as:

'invalid_value' is an invalid value for timeUnit. Valid values are:

  • One of [...]
  • Of type 'object'

This would also be easier to control the formatting than relying on the groups provided by the enums alone

Note to self: Fill out comments re individual changes.
Temporary, will remove before review.
Tried to isolate to a single function so that I can reproduce on main
I renamed this from `_SLOW_BENCHMARKS` but forgot to invert the bool lol
@dangotbanned
Copy link
Member Author

dangotbanned commented Sep 3, 2024

Edit

See narwhals-dev/narwhals#905

Bug report unrelated to PR

@MarcoGorelli do you have any ideas on what is causing this error for ibis on python==3.8?

https://github.com/vega/altair/actions/runs/10681007191/job/29608118357?pr=3547

I did a merge update from the GitHub UI, but haven't been able to track down the cause yet.

Updated

Seems to be this commit which causes the issue (only for 3.8) narwhals-dev/narwhals@3d246b7

..\..\..\AppData\Local\hatch\env\virtual\altair\CXM7NV9I\hatch-test.py3.8\lib\site-packages\narwhals\dependencies.py:128: in is_ibis_table
    return bool((ibis := get_ibis()) is not None and isinstance(df, ibis.Table))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  

name = 'Table'

    def __getattr__(name: str) -> BaseBackend:
        """Load backends in a lazy way with `ibis.<backend-name>`.

        This also registers the backend options.

        Examples
        --------
        >>> import ibis
        >>> con = ibis.sqlite.connect(...)

        When accessing the `sqlite` attribute of the `ibis` module, this function
        is called, and a backend with the `sqlite` name is tried to load from
        the `ibis.backends` entrypoints. If successful, the `ibis.sqlite`
        attribute is "cached", so this function is only called the first time.
        """
        entry_points = {ep for ep in util.backend_entry_points() if ep.name == name}

        if not entry_points:
            msg = f"module 'ibis' has no attribute '{name}'. "
            if name in _KNOWN_BACKENDS:
                msg += f"""If you are trying to access the '{name}' backend,
                        try installing it first with `pip install ibis-{name}`"""
>           raise AttributeError(msg)
E           AttributeError: module 'ibis' has no attribute 'Table'.

..\..\..\AppData\Local\hatch\env\virtual\altair\CXM7NV9I\hatch-test.py3.8\lib\site-packages\ibis\__init__.py:58: AttributeError

Used in `Chart.from_dict`
Comment on lines +75 to +86
__all__ = [
"Optional", # altair.utils
"SchemaBase", # altair.vegalite.v5.schema.core
"Undefined", # altair.utils
"UndefinedType", # altair.vegalite.v5.schema.core -> (side-effect relied on to propagate to alt.__init__)
"_is_valid", # altair.vegalite.v5.api
"_resolve_references", # tools.schemapi.utils -> tools.generate_schema_wrapper
"_subclasses", # altair.vegalite.v5.schema.core
"is_undefined", # altair.typing
"validate_jsonschema", # altair.utils.display
"with_property_setters", # altair.vegalite.v5.schema.channels
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping track of the refs here, since you can't follow the references from the tools copy

All were notes added earlier in PR, but not needed now
WIll put this on another branch, won't improve performance so not relevant here

vega#3547 (comment)
Feeling I've squeezed out all the performance I can for now.
Will add in a collpased comment on the PR for reference
@dangotbanned dangotbanned changed the title perf(DRAFT): Lazy schema validation & error messages refactor(perf): Lazy rewrite of schemapi, improve docs Sep 3, 2024
@dangotbanned dangotbanned changed the title refactor(perf): Lazy rewrite of schemapi, improve docs refactor(perf): Lazy rewrite of schemapi.py, improve docs Sep 3, 2024
@dangotbanned dangotbanned marked this pull request as ready for review September 3, 2024 16:31
@binste
Copy link
Contributor

binste commented Sep 4, 2024

Quite the PR!! Will take me a while to dig through it but thanks for going down all these rabbit holes, hope you came out unharmed! ;) I like jsonschema validation but it also can be a wild place, especially with the mapping of the schema to Altair objects...

Do I understand this correct that the main motivation for this PR is to improve the performance and that the main impact on performance is that from_dict is considerably faster based on your benchmark (which is super cool!), where as there is no change in to_dict and the validation?

@dangotbanned
Copy link
Member Author

dangotbanned commented Sep 4, 2024

Quite the PR!! Will take me a while to dig through it but thanks for going down all these rabbit holes, hope you came out unharmed! ;) I like jsonschema validation but it also can be a wild place, especially with the mapping of the schema to Altair objects...

Thanks and yeah no rush @binste - I know there is a lot to go through - even after all the stuff I removed over the past few days.

Do I understand this correct that the main motivation for this PR is to improve the performance and that the main impact on performance is that from_dict is considerably faster based on your benchmark (which is super cool!), where as there is no change in to_dict and the validation?

So it is tricky.

I found it quite difficult to benchmark this, since there are so many interconnected parts.
I can confidently say from_dict is faster, but of the three in the benchmark that is the only one which benefitted.

We also have much better performance on 3.8-3.11 with changes like 933c045 (#3547).
All versions are pretty much equal now on the second run of hatch test --all.


I would like to - in a future PR - develop some more isolated benchmarking tools.
Ideally, these would be based on specs that have proved to be slow in real-world use, so that we can more easily get to the root cause

@binste
Copy link
Contributor

binste commented Sep 22, 2024

I very much appreciate the time and effort you put into this and I believe some/many/maybe all changes here could be great improvements to Altair. However, as it's such a large and complex PR, I'm struggling to review this PR to an extent where I'd feel comfortable approving it.

Although improvements to the performance of from_dict are great, it's a feature that is not used by many users. I'm worried that the time someone would need to invest to review this PR could be better used elsewhere as well as the added complexity to the codebase which is a lot (it's hard to find developers familiar enough with jsonschema which want to work on Altair, the more complicated we make this part, the harder this gets).

Would you be willing to split this PR into multiple ones? They could be put up for review and merged one after the other with this PR (in draft mode) being the resource to cherry-pick commits from. I think this would help with reviewing as well as with judging the tradeoff between added complexity of some of the changes vs. improved performance.

  • Python 3.8-3.11 performance improvements: You mentioned that this mainly comes from 933c045. This could be a separate PR, easy to merge, and already a big win for users.
  • from_dict performance improvements: Do you have a feeling for which changes lead to this improvement? Is it everything in the current PR or can it be boiled down to fewer more isolated changes?

I hope this is ok for you and that you understand my position, i.e. that it does not feel random. We can also do more pre-alignment on rough drafts before starting to work on detailed PRs. Also something we can pick up in our upcoming call.

@dangotbanned dangotbanned marked this pull request as draft September 22, 2024 16:49
@dangotbanned
Copy link
Member Author

#3547 (comment)

I hope this is ok for you and that you understand my position, i.e. that it does not feel random.

@binste
Honestly I would say the main benefit of this PR was for me to understand this part of altair a little better.
That is, the process of working on it I see as most valuable - so please don't worry I totally understand that it isn't an easy review.

Would you be willing to split this PR into multiple ones?

Yeah that sounds like a good idea.
I will need to have a think about how best to do that, but at the very least I could get to the same outcome with a cleaner history and less experimentation.

I'll loop back to this at some point and see what smaller PRs I can make out of it.

Appreciate your feedback on this, regardless of the outcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants